protein structure prediction
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
A Supplementary materials
A.2 Documentation and intended uses We include a datasheet [1] in Section B. Detailed documentation on the precise structure and content OpenProteinSet is made available under the CC BY 4.0 license. The authors bear all responsibility in case of violation of rights. OpenProteinSet will continue to be hosted on RODA for the foreseeable future. A.7 Alignment tool settings For JackHMMer, we used -N 1 -E 0.0001 -incE 0.0001 -F1 0.0005 -F2 0.00005 -F3 0.0000005 and then capped outputs at depth 5000. B.1 Motivation For what purpose was the dataset created?
- Health & Medicine > Pharmaceuticals & Biotechnology (0.72)
- Law (0.47)
MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high-quality MSA. Although various methods have been proposed to generate high-quality MSA under these conditions, they fall short in comprehensively capturing the intricate co-evolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pre-training in a low-MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model the complex evolutionary patterns.
OpenProteinSet: Training data for structural biology at scale
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization
Liu, Zijing, Feng, Bin, Cao, He, Li, Yu
Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings to bridge the semantic gap between the sequence and structural "language". The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as "structural synonyms". This redundancy, rather than being a flaw, can be exploited with a simple "synonym swap" strategy to generate diverse conformational ensembles by perturbing a predicted structure with its structural synonyms. This computationally lightweight method accurately recapitulates protein flexibility, performing competitively with state-of-the-art models. Our study provides fundamental insights into the nature of discrete protein structure representations and introduces a powerful, near-instantaneous method for modeling protein dynamics. Source code is available in https://github.com/IDEA-XL/TokenMD.
- Asia > China > Guangdong Province > Shenzhen (0.41)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Hong Kong (0.04)
The Role of AI in Facilitating Interdisciplinary Collaboration: Evidence from AlphaFold
Zhao, Naixuan, Wei, Chunli, Zhang, Xinyan, Li, Jiang
The acceleration of artificial intelligence (AI) in science is recognized and many scholars have begun to explore its role in interdisciplinary collaboration. However, the mechanisms and extent of this impact are still unclear. This study, using AlphaFold's impact on structural biologists, examines how AI technologies influence interdisciplinary collaborative patterns. By analyzing 1,247 AlphaFold-related papers and 7,700 authors from Scopus, we employ bibliometric analysis and causal inference to compare interdisciplinary collaboration between AlphaFold adopters and non-adopters. Contrary to the widespread belief that AI facilitates interdisciplinary collaboration, our findings show that AlphaFold increased structural biology-computer science collaborations by just 0.48%, with no measurable effect on other disciplines. Specifically, AI creates interdisciplinary collaboration demands with specific disciplines due to its technical characteristics, but this demand is weakened by technological democratization and other factors. These findings demonstrate that artificial intelligence (AI) alone has limited efficacy in bridging disciplinary divides or fostering meaningful interdisciplinary collaboration.
- North America > United States (0.14)
- Asia > China > Hong Kong (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)